I. Introduction

This project seeks to develop a robust model for analyzing and predicting college tuition rates across the United States. Utilizing data from the Department of Education’s College Scorecard link, US News & World Report, and additional federal financial aid data for the 2022-2023 academic term, this analysis aims to shed light on the factors influencing tuition rates at Title-IX accredited institutions.

The central question of this research is: What are the primary determinants of college tuition rates across various institutions in the U.S.?

This analysis seeks to integrate multiple data sources to create a comprehensive model of tuition rate dynamics. By applying advanced regression techniques, managing missing data, and evaluating the impact of various predictors, this study aims to provide valuable insights for the considerations of college affordability crucial for prospective students and their families as they navigate the financial implications of higher education. Picking a school is a difficult decision, especially when there are so many to choose from; information on how certain features and characteristics trend with tuition values would be a valuable resource in that decision making process.

This analysis builds on existing research in educational economics such as that of Frazão (2019) and Xia and Niu (2022) link which explores the economic factors influencing tuition rates and educational accessibility by offering a more nuanced understanding of the factors that contribute to tuition variability. By better understanding the factors influencing tuition rates and working to build a robust predictive model for analyzing college affordability, this project aims to contribute to the broader discourse on college affordability and provide actionable insights for stakeholders in the education sector.

II. Data Description

The data used in this analysis is comprised of information from the Integrated Postsecondary Education Data System (IPEDS) on institutions participating in Title IV programs. Among the data sourced are measures of tuition data, federal financial aid statistics, and institutional characteristics.

Source: Data was sourced from the Department of Education’s College Scorecard API and collected from the U.S. News & World Report’s college rankings using a custom Python script for data extraction.

The College Scorecard dataset is an aggregation of data about post-secondary educational institutions collected and provided by the United States’ Department of Education. The dataset contains data about the given institution, its students, and its costs. This data includes but is not limited to: location, fields of study offered, acceptance rates, graduation rates, loan information, etc.

The US & World News Report on Best Colleges Ranking is an annual set of rankings of US colleges and universities, described as the most influential institutional ranking in the country. To scrape from their database, I modified code by K. Chang to retreive my desired variables from desired pages link. Further, I employed the BeautifulSoup package for parsing HTML and data export.

Size: The original merged dataset contained 6,484 observations of 3,313 variables. The final cleaned dataset includes 1,433 institutions with 39 variables. A dictionary of features and definitions was provided by Department of Education. I went through this dictionary and narrowed down the features relevant to my search.

Further, I pared down the number of observations (institutions), filtering the data to only include 4-year institutions which primarily award Undergraduate (Bachelors) degrees based on ICLEVEL and PREDDEG features, and removed observations with undisclosed Instate_Tuition values. These filters pared my total number of observations down from 6,484 to 1,915.

I considered 27 of these factorized and quantitative variables in my modeling.

Dataset Details * Original Dataset: 6,484 observations, 3,313 variables * Final Cleaned Dataset: 1,433 institutions, 39 variables

Variables: Variable OPE_ID was the primary identifying variable for linking and integrating datasets. This unique identifier was crucial for resolving discrepancies in zip codes and other potentially inaccurate information, ensuring consistency and accuracy across the integrated data sources. Complementary identifying characteristics such as State, ZIP, LATTITUDE, and LONGITUDE were valuable for situating the institutions and considering spatiality.

Variable Instate_Tuition was the primary response variable, representing the tuition rates charged to in-state students at various institutions. To address the skewness often present in tuition data and improve the model’s performance, it was eventually scaled by way of a Box-Cox Transformation, into Instate_Tuition_transformed. This transformed variable provides a more normalized measure of tuition rates.

Several explanatory variables were considered in the analysis to assess their impact on tuition rates. These variables include Admission_Rate, SAT_Avg, and other demographic and financial indicators.

The data underwent several preprocessing steps to handle missing values, including imputation using k-nearest neighbors (kNN) and random forest methods.

I explicitly ensured that categorical variables like State, REGION, LOCALE, Pub_Priv, Relig_Affil, Minority_Serving_Category, and Stand_Test_Req were read as factors. This step ensured that the variables were treated correctly during analysis.

State: Represents the U.S. FIP code of the state where the institution is located.

REGION: Denotes the geographical region of the institution (e.g., Northeast, Midwest).

LOCALE: Describes the institution’s setting (e.g., city, suburb, rural). Pub_Priv: Indicates whether the institution is public or private.

Relig_Affil: I turned this variable from a multilevel factor into a binary indicator specifying whether the institution is or is not religiously affiliated.

Minority_Serving_Category: Identifies if the institution serves a significant number of minority students (e.g., HBCU, Hispanic-Serving), the factor indicates which minority population.

Stand_Test_Req: Indicates whether standardized test scores were required for admission.

Very Low Enrollment Counts

Observing the distribution of undergraduate enrollment, and a summary of the Undergrad variable, it becomes clear that there are some institutions reporting exceptionally low enrollment. Based on the Carnegie Classification of Institutions of Higher Education, a university or college is considered “small” if it has fewer than 5,000 students, and “very small” if it has fewer than 1,000 students. To focus our analysis on institutions with more substantial enrollments, I’ve exclude those in the bottom quartile of undergraduate enrollment in this dataset. This threshold corresponds to institutions with fewer than 646 enrolled students. This adjustment reduces the number of observations from 1,915 to 1,436.

Illogical Values

There were certain observations that reported nonsensical values for specific variables. Specifically, Thomas Edison State University reported $0 in Instructional Expense per full-time-enrollee. Similarly, sixteen institutions including Brown University reported $0 in book and supply costs. In these cases I replaced the values with NAs, which I would later impute.

Invalid Rows
Name Instructional_Expenses_per_FTE Books_Supplies_Cost
Edward Waters University 3469 0
Young Harris College 7976 0
National Louis University 5793 0
Quincy University 6710 0
Eastern Kentucky University 9152 0
University of Maryland Global Campus 2694 0
Herzing University-Minneapolis 5511 0
Thomas Edison State University 0 1700
University of Tulsa 15551 0
Brown University 35131 0
National American University-Rapid City 1632 0
Trevecca Nazarene University 8893 0
University of Advancing Technology 5007 0
Provo College 5756 0
American Public University System 1785 0
Columbia Southern University 1546 0
Indiana Institute of Technology-College of Professional Studies 5619 0

This correlation plot displays moderate strength and direction of linear relationships between certain variables. Visualizing these correlations helps in exploring potential multicollinearity issues. I expected to find that financial variables like *Instructional_Expenses_per_FTE, Avg_Faculty_Salary, and Books_Supplies_Cost would have somewhat formed a cluster, while academic performance indicators like SAT_AVG and Completion_Rate might form another, and was surprised to see this was not so clear. The correlation plot is a valuable tool for exploring relationships between variables and identifying patterns that might inform further analysis.

Filtering Variables

I was optimistic about using the Rankings provided by the US & World News Report, however, the rankings were clustered by School_Type such that, for example, Princeton University was ranked #1 as a national university, but simultaneously Wellsley College was ranked #1 as a liberal arts college, as were individual religious and local institutions. I attempted to mitigate this by filtering only for “national_universities” and “liberal_arts_colleges,” and justifying that the rank could be shared with literature on the subject, while including an indicator variable Tied_Ranking and an interaction term between Rank and School_Type, however there were so few rankings remaining (fewer than 50% of total observations), and this was certainly not random, so I forwent use of the Rank variable for the purposes of this analysis.

At this point, the dataset had come down from an unwieldly 3,313 variables to 39. The following table lists all the variable names in the college_data_clean dataset:

Variable Names in Cleaned Dataset
Variable Names
OPE_ID NonRes_Alien_Undergrads
Name.x Instate_Tuition
State Instructional_Expenses_per_FTE
ZIP Avg_Faculty_Salary
REGION Stand_Test_Req
LOCALE Completion_Rate
LATITUDE Retention_Rate
LONGITUDE Pct_Fed_Loans
Pub_Priv Pct_Pell_Grants
Admission_Rate Books_Supplies_Cost
SAT_AVG Minority_Serving_Category
Undergrads Relig_Affil
Male_Undergrads Name.y
Fem_Undergrads School_Type
White_Undergrads Rank
Black_Undergrads Tied_Rank
Hisp_Undergrads ACT_Avg
Asian_Undergrads HS_GPA_Avg
AIAN_Undergrads SAT_Avg
NHPI_Undergrads NA
Note:
Data provided by the Department of Education’s College Scorecard and US & World News

The table below summarizes the central tendencies and dispersions of continuous variables. These statistics provide insights into the typical values and variability present in the dataset:

Summary Statistics for Selected Variables
Min 1st Qu..25% Median Mean 3rd Qu..75% Max
Instate_Tuition 1008 10052.2500 22325 26262.8078 38762.5000 66490
Admission_Rate 0.0269 0.6270 0.7700 0.7183 0.8752 1
Undergrads 647 1279.7500 2421.5000 5826.7305 6152.7500 138138
Fem_Undergrads 0 0.5192 0.5758 0.5769 0.6341 1
White_Undergrads 0 0.3824 0.5752 0.5244 0.7139 0.9950
Black_Undergrads 0 0.0436 0.0774 0.1385 0.1506 0.9844
Hisp_Undergrads 0 0.0578 0.1005 0.1566 0.1774 1
Asian_Undergrads 0 0.0129 0.0258 0.0511 0.0568 0.4983
AIAN_Undergrads 0 0.0012 0.0025 0.0061 0.0053 0.3235
NHPI_Undergrads 0 0.0004 0.0010 0.0025 0.0021 0.4355
NonRes_Alien_Undergrads 0 0.0101 0.0237 0.0420 0.0520 0.7566
Instructional_Expenses_per_FTE 1125 7456 9932 11834.3331 13310.5000 142185
Avg_Faculty_Salary 2112 7099.5000 8327 8819.1127 10138.7500 22761
Completion_Rate 0 0.4488 0.5759 0.5745 0.7002 1
Retention_Rate 0 0.6806 0.7539 0.7463 0.8293 1
Pct_Fed_Loans 0 0.3793 0.5048 0.5015 0.6364 0.9683
Pct_Pell_Grants 0 0.2507 0.3462 0.3635 0.4499 0.8895
Books_Supplies_Cost 100 1000 1200 1246.7718 1400 8470
SAT_AVG 850 1087.7500 1156 1183.0551 1264 1560

Missing Values

To address missing values in the dataset, an initial step was to identify which variables had the most significant amounts of missing data. This was accomplished by calculating the total number of missing values for each variable and determining the proportion of missing values relative to the total number of observations. A threshold was set to highlight variables with more than one-third of values missing, flagged for further examination. This threshold was chosen to balance the need for a substantial amount of data. The only two variables for which this was an issue were Rank and HS_GPA, so I elected to remove them from consideration for modeling.

III. Methodology

Certain types of missing data were easier to account for. Both US & World and the College Scorecard data included metrics on average SAT scores, and these appeared to be comparable. I conducted a t-test of means between the values of two groups (private and public institutions) to see if there was a statistically significant difference between them– there was not, so where College Scorecard information was unavailable (SAT_AVG), I deferred to US & World (SAT_Avg), and replaced them with new variable, Full_SAT. This brought me from about 500 missing values to 300.

Further, to determine whether the remaining missing data was “Missing Completely at Random” (MCAR), Little’s MCAR test was employed. This test assesses whether the probability of missing data is independent of the observed data. The null hypothesis of this test posits that the missing data is “missing completely at random.”

The result of Little’s MCAR test, however, yielded a p-value of 0. This result led me to reject the null hypothesis, suggesting that the missing data was not Missing Completely at Random. Consequently, the non-random nature of the missing data indicates that simply removing observations with missing values could introduce bias into the analysis as they are not ignorable. This finding highlights the necessity of employing more sophisticated methods for handling missing data, such as an imputation technique for the sake of subsequent regression analyses.

Value Imputation

To address the missing values within the dataset, I attempted two imputation techniques: k-nearest neighbors (KNN) and random forest imputation. The dataset college_data_clean was first preprocessed to exclude character variables irrelevant to numerical imputation, then KNN imputation was performed with various values of \(k\) (3, 5, 7, and 10) to assess the robustness of the imputed values.

First, I checked the usefulness of using k-nearest neighbors (KNN) imputation to handle missing values for SAT scores and visually comparing the distributions of the original and imputed data using different values of \(k\). Histograms comparing the distributions of SAT scores between the original and imputed datasets were generated to evaluate the impact of different values on the imputation process. Testing different values of \(k\) helped to identify how varying the number of neighbors affected the imputation results. Ultimately, 7 appeared to be the best balance between over-fitting and under-prediction.

Subsequently, random forest imputation was applied using the missForest package. This approach was chosen to capture the complexity of relationships in the data with a more nuanced imputation.

Linear regression models with all reasonable variables were fitted using the both of these imputed datasets to predict in-state tuition, and the regression models were compared based on their adjusted \(R^2\) values.

\(Y_i= \beta_0+\beta_1x_1+\beta_2x_2+...+\beta_px_p+\epsilon\).

\(\text{Instate_Tuition} = \beta_0 + \beta_1 \text{Pub_Priv} + \beta_2 \text{Undergrads} + \beta_3 \text{NonRes_Alien_Undergrads} + \beta_4 \text{Avg_Faculty_Salary} + \beta_5 \text{Completion_Rate} +\)

\(\beta_6 \text{Pct_Fed_Loans} + \beta_7 \text{Pct_Pell_Grants} + \beta_8 \text{School_Type} + \beta_9 \text{REGION} + \beta_{10} \text{Admission_Rate} + \beta_{11} \text{Hisp_Undergrads} +\)

\(\beta_{12} \text{Asian_Undergrads} + \beta_{13} \text{NHPI_Undergrads} + \beta_{14} \text{Instructional_Expenses_per_FTE} + \beta_{15} \text{Stand_Test_Req} + \beta_{16} \text{ACT_Avg} +\)

\(\beta_{17} \text{Books_Supplies_Cost} + \epsilon\)

The model based on data imputed with Random Forest provided an adjusted \(R^2\) of 0.8721, while the model with KNN imputed data had a barely higher adjusted \(R^2\) of 0.8741. This was in contrast to a model based on the original, unimputed data, which delivered an adjusted \(R^2\) of 0.9122. Thus, the model using the original, complete dataset provided the highest adjusted \(R^2\) value, indicating superior explanatory power compared to models based on imputed data, however this model excluded 1,077 of the 1,436 observations, and Little’s MCAR test already indicated that the missingness was not random, so this was likely to obscure meaningful variation in the data. The diagnostics, too, of this model were not ideal (See Appendix: Fig. 1). Therefore, I proceeded with an imputed model.

Although the Adjusted \(R^2\) of the KNN (\(k=7\)) model was higher than the random-forest imputation, the plotted diagnostics of the KNN model indicated that four points in the model were flagged for having leverage values of one. The random-forest model had only two leverage-values of one, and a very similar adjusted \(R^2\), so that is what I proceeded to use as a dataset for my future models.

Still, the diagnostics for this random-forest imputed model were not ideal, and the fact that there were two leverage values of one, revealed some concerns regarding the model’s assumptions and the influence of specific observations.

Leverage Values and Distortionary Points

To identify high leverage points, which can disproportionately affect the results of a regression model and skew analysis, I calculated the leverage threshold using the formula: \(h_i>2(\frac{p+1}{n})\) where \(p\) is the number of parameters and \(n\) is the sample size. For my model, the threshold was computed as ≈0.090.

A plot of hat values revealed observations with leverage values exceeding this threshold. The two observations with leverage values of exactly one were the Rose-Hulman Institute of Technology (Observation 315) and the United States Merchant Marine Academy (Observation 759). Additionally, the University of Guam (observation 1240) exhibited notable leverage. Cook’s distance \(D_i = \frac{t_i^2}{p+1}(\frac{h_i}{1-h_i})\) at this observation was notably high at 0.8526, and it fell outside of the 0.5 contour on the Residuals vs Leverage plot, justifying its exclusion from the dataset.

I also plotted hat values from the RF imputed model and identified two other high-leverage points: observations 263 (Moody Bible Institute of Chicago) and 383 (The Southern Baptist Theological Seminary), both with leverage values of 0.5153. While this was above our threshold, there was nothing else about these two values that indicated cause for concern, so I retained them.

Moran’s I

The case of the University of Guam in particular was fascinating, as I had not until this point engaged the spatiality of these points, and naturally, an observation so spatially far-removed from all the others would be far-removed in other meaningful respects as well. To this end, and to get an idea for where the institutions included in the dataset were situated, I mapped them, with a color gradient indicating their instate tuition in scale. This revealed just how clustered the institutions in the dataset are, and led me to calculate a local MORANS-I calculation of spatial autocorrelation.

Moran’s I is a metric that computes the correlation between each institution’s tuition amount and those of its neighboring institutions, defined mathematically as:

\[ I = \frac{N S_0}{\sum_{j=1}^N z_j^2} \sum_{i=1}^N \sum_{j=1}^N w_{i,j} z_i z_j \]

where:

  • \(N\) is the number of districts indexed by \(i\) and \(j\),
  • \(w_{i,j}\) is a matrix describing neighboring connections,
  • \(S_0\) is the total number of neighboring connections (i.e., the sum of \(w_{i,j}\)),
  • \(z\) is the z-score for the observed test statistic.

Likewise, Local Indicators of Spatial Association (LISA) is another measure of local spatial autocorrelation, but unlike Moran’s I, which provides a single global measure of spatial association, LISA calculates spatial autocorrelation for each location individually.

The Local Moran’s I statistic for a location \(i\) is defined as:

\[ I_i = \frac{w_{i,j} (z_i - \bar{z})(z_j - \bar{z})}{\sum_{j=1}^N (z_j - \bar{z})^2} \]

where:

  • \(w_{i,j}\) is the spatial weight between locations \(i\) and \(j\),
  • \(z_i\) is the value at location \(i\),
  • \(\bar{z}\) is the mean of the values across all locations,
  • \(N\) is the total number of locations.

LISA Results often reveal spatial patterns:

  1. High-High Clusters: Locations with high values surrounded by other high values.
  2. Low-Low Clusters: Locations with low values surrounded by other low values.
  3. High-Low Outliers: Locations with high values surrounded by low values.
  4. Low-High Outliers: Locations with low values surrounded by high values.

I was somewhat surprised to observe rather small absolute LISA values, but visualizing this revealed notable patterns of high-low outliers in areas like Chicago, Massachusetts, Connecticut, and Southern California with higher negative Local Moran’s I (LISA) values, usually at prestigious private liberal arts schools– Wesleyan, Brandeis, Trinity, Wellesley, Vassar, and Pepperdine Colleges among others. While the low LISA values for expensive often private institutions in these regions might reflect their status as outliers, as these schools are unique/ atypical compared to immediately neighboring institutions. The clustering of negative Local Moran’s I values in these coastal regions suggests that these areas have unique spatial patterns in terms of tuition costs. Investigating regional factors, policies, and economic conditions would likely provide greater insights into why these patterns exist and how they reflect broader educational and economic dynamics.

I excluded observations 315, 759, and 1240 from the dataset and refitted the regression model. This recalibrated model (model_rf_imputed2) now included 1433 observations and was subjected to the same diagnostic checks to evaluate improvements in model fit and robustness.

The updated diagnostic plots for model_rf_imputed2 were analyzed to assess whether the exclusion of influential observations improved the model’s adherence to regression assumptions and overall fit (See: Fig. 4).

Multivariate Regression Inference

As diagnostic plots of this latest model still revealed issues with linearity, and to an extent, normality and heteroskedasticity, I elected to transform the response variable, Instate_Tuition to better achieve normality and stabilize variance. To this end, I used a Box-Cox transformation to find the optimal transformation. The Box-Cox transformation is a family of power transformations that are used to stabilize variance and make the data more closely meet the assumptions of a linear regression model, defined for a given response variable \(Y\) as:

\[ Y(\lambda) = \begin{cases} \frac{Y^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \log(Y) & \text{if } \lambda = 0 \end{cases} \]

For this analysis, a Box-Cox transformation was applied to the Instate_Tuition variable with \(\lambda = 0.1818182\). This specific value of \(\lambda\) was chosen to best normalize the distribution of the Instate_Tuition variable and to meet the assumptions of the linear regression model. The transformation is expressed as: \[ \text{Instate_Tuition_transformed} = \frac{(\text{Instate_Tuition}^{0.1818182} - 1)}{0.1818182} \]

This transformation is meant to help stabilize variance and make data more symmetric. Here, the transformation improved the model’s adjusted \(R^2\) from 0.872 to 0.893, enhancing fit, and somewhat mitigating issues of non-linearity and heteroscedasticity. While the diagnostics for this model (see: Fig. 5) were not still ideal, I determined they were reasonably and sufficiently linear, normal, and homoskedastistic.

Additional transformations (log, square root, and cubic) were also explored to compare their effectiveness in improving model assumptions and fit evaluated with diagnostic plots to assess their impact on linearity, normality, and homoscedasticity (Figs. 6-8, respectively). While the cubic transformation was promising, it was still inferior to the Box-Cox transformation.

\(\text{Instate_Tuition}^3 = \beta_0 + \beta_1 \text{Pub_Priv} + \beta_2 \text{Undergrads} + \beta_3 \text{NonRes_Alien_Undergrads} + \beta_4 \text{Avg_Faculty_Salary} + \beta_5 \text{Completion_Rate}\)

\(+\beta_6 \text{Pct_Fed_Loans} + \beta_7 \text{Pct_Pell_Grants} + \beta_8 \text{School_Type} + \beta_9 \text{REGION} + \beta_{10} \text{Admission_Rate} + \beta_{11} \text{Hisp_Undergrads} +\)

\(\beta_{12} \text{Asian_Undergrads} + \beta_{13} \text{NHPI_Undergrads} + \beta_{14} \text{Instructional_Expenses_per_FTE} + \beta_{15} \text{Stand_Test_Req} + \beta_{16} \text{ACT_Avg} + \beta_{17}\) \(\text{Books_Supplies_Cost} + \epsilon\)

Further, I checked the model for issues of multicollinearity with a test of its variation inflation factor (VIF). The standardized generalized variance inflation factor, \(GVIF^{1/(2 \cdot Df)}\), values were all between one and four, revealing moderate collinearity between several variables, though not a troubling degree. This measure is the GVIF standardized across variables with different numbers of levels, as the adjustment (being raised to the power of \({1/(2 \cdot Df)}\)) made the values comparable to traditional VIF measures, and I’ve assessed them along similar thresholds.

At this point it was worth considering a higher-order transformation on one of the explanatory variables, and the distribution of Instructional_Expenses_per_FTE made it a potentially appropriate choice for a polynomial transformation, as its distribution appears to be curved in way that might suggest that a non-linear relationship could exist. The scatterplot and LOESS curve below provide suggest such a transformation might accommodate a complex relationship between Instructional_Expenses_per_FTE and Instate_Tuition_transformed. The LOESS (Locally Estimated Scatterplot Smoothing) curve was overlaid as a smoothing technique to help to identify underlying patterns.

To this end, I implemented a custom function to loop through a regression of Instructional_Expenses_per_FTE raised to the 1st-6th degrees. This function generated formulas for polynomial regression and fit these models to the data. To evaluate the performance of these polynomial models, I employed a 5-fold cross-validation approach to determine the optimal polynomial degree based on the root mean square error (RMSE) \(\sqrt{\frac{1}{K}\sum_{j=1}^K{MSE_j}}\).

Polynomial degrees ranging from 1 to 6 were tested. For each degree, the polynomial model was fit, and the cross-validated RMSE was recorded. The degree that minimized the RMSE was selected as the best-performing polynomial degree. Ultimately, although a fourth-degree polynomial transformation for Instructional_Expenses_per_FTE was deemed optimal, it did not significantly enhance the model’s predictive performance. Therefore, while higher-order terms were explored to address potential remaining non-linearity in diagnostic plots, the practical benefit in this context was limited, as incorporating such a transformation did not provide substantial improvement over simpler models.

Subset Selection

Returning to my Box-Cox transformed, and Random Forest-imputed linear model, and with limited other options in the way of mitigating non-linearity, I decided to proceed with reservations, beginning by testing whether the model was capturing anything meaningful in its full, bloated form.

I performed Analysis of Variance (ANOVA) to test the overall significance of the model. The extremely small output \(p-value < 2.2 \times 10^{-16}\) provides strong evidence against the null hypothesis, that the model terms do not significantly explain the variability in the transformed instate tuition. In other words, at least some of the predictors in the model have a meaningful contribution to explaining the variation in Instate_Tuition_transformed, thus validating that the predictors used in the model are not all negligible. Thus I will proceed with more carefully tuning model parameters and model selection.

Exhaustive Cp-Selection

To identify the best subset of predictors, I first employed an exhaustive approach by examining all \(2^p−1\) possible subset models using the regsubsets function from the leaps package. This returns both a model that maximized adjusted R-squared and one that minimized the Mallow’s Cp. In this case, this was the same model for either parameter. The following variables were consistently selected by both criteria: Pub_Priv, Undergrads, NonRes_Alien_Undergrads, Avg_Faculty_Salary, Completion_Rate, Pct_Fed_Loans, Pct_Pell_Grants, and School_Type. This model had an adjusted \(R^2\) of 0.8761. Visually, linearity was far improved in this reduced model than for prior iterations of the full model. Normality and homoskedasticity appeared marginally improved if not much different, but still not concerning (see: Fig. 9).

Stepwise Selection Tuned to AIC

Next, I performed a stepwise selection using Akaike Information Criterion \(2p-2log(L(\hat{\theta}))\) (AIC) to refine the model. Here, the final stepwise model included all of the covariates selected in the cp_model, as well as the following additional variables: REGION, Admission_Rate, Hisp_Undergrads, Asian_Undergrads, NHPI_Undergrads, Instructional_Expenses_per_FTE, Stand_Test_Req, ACT_Avg, and Books_Supplies_Cost. This larger model likewise returned a higher adjusted \(R^2\) of 0.8919. The change in variables did not drastically change the state of linearity, normality, and homoskedasticity (see: Fig. 10). Again, linearity appeared much improved in comparison with earlier full, more complex models. Other assumptions, of normality homoskedasticity still appear reasonably met, and no evident distortionary points.

LASSO-Selection

Finally, I employed LASSO (least absolute shrinkage and selection operator) selected regression. LASSO performs variable selection by shrinking some coefficients to zero. Thus LASSO can retain variables that provide even a small amount of predictive power by shrinking their coefficients but not eliminating them entirely. The best lambda value to apply in the LASSO function was identified through cross-validation by way of the glmnet package. Predictably, the final Lasso-model included still more variables in addition to the variables both of the prior models retained. Here the additional covariates were White_Undergrads, LOCALE, Retention_Rate, Minority_Serving_Category, and Relig_Affil. This model yielded an adjusted \(R^2\) value of 0.8917. Diagnostics for linearity, normality, and homoskedasticity appeared much the same as those for the other two reduced selected models (see: Fig. 11).

To then compare these three reduced models, I conducted t-tests and ANOVA F-tests to assess whether additional variables significantly improved the model fit, and compared the AIC and BIC values for the models.

First, I performed pairwise t-tests comparing the means of the residuals to determine if there was a significant difference in the fit of the three models. I found, however, that the means of the residuals for cp_mod, aicstepwise_model, and final_lasso_model were not significantly different from each other, as the p-values were extremely high (approximately 1).

Next, I conducted ANOVA F-tests between the models to see if the additional variables significantly improved model fit. The ANOVA f-test compared models based on their residual sum of squares (RSS) and degrees of freedom.

Here, the aicstepwise_model, which included additional variables, provided a significantly better fit than final_mod, as the p-value is extremely small, indicating that the added variables significantly improved the model.

The final_lasso_model, with its additional variables, also showed a significant improvement over final_mod.

There was no significant difference, however, between aicstepwise_model and final_lasso_model in terms of fit as suggested by the high p-value, so it does not seem that the additional variables in the final_lasso_model provided a statistically significant improvement over aicstepwise_model. Thus, while both the aicstepwise_model and final_lasso_model significantly outperformed final_mod in terms of fit, between aicstepwise_model and final_lasso_model, there is no statistically significant difference in fit. For the sake of interpretability and parsimony, then, the model derived by stepwise-AIC selection would be the optimal choice thusfar.

I then checked the AIC and BIC values for each model. Again, the stepwise model provided the best fit, as indicated by the lowest AIC and BIC values. Thus, the stepwise AIC model outperformed the others in highest adjusted \(R^2\), lowest AIC, and lowest BIC.

Finally, I performed 5-fold cross-validation to assess the predictive performance of the models. Upon applying the training control on each of the three models, the stepwise AIC model again showed superior performance, with the lowest Root Mean Squared Error value of 1.53 in contrast with 1.59 for the CP model and 1.56 for the Lasso model.

I believe, definitively now, that the stepwise selected AIC model is the best of these three reduced models. It outperformed the more parsimonious CP model in adjusted-\(R^2\), AIC, BIC, and an f-test, and was no worse than the more complex lasso model in the same f-test, despite being more parsimonious, while performing better than the lasso model in adjusted-\(R^2\), AIC, BIC. Thus, I will move forward with this model.

We are left, then, with an instate tuition response variable that has been transformed as such: \[ \text{Instate_Tuition_transformed} = \frac{(\text{Instate_Tuition}^{0.1818182} - 1)}{0.1818182} \]

and with seventeen explanatory variables: Pub_Priv, Undergrads, NonRes_Alien_Undergrads, Avg_Faculty_Salary, Completion_Rate, Pct_Fed_Loans, Pct_Pell_Grants, School_Type, REGION, Admission_Rate, Hisp_Undergrads, Asian_Undergrads, NHPI_Undergrads, Instructional_Expenses_per_FTE, Stand_Test_Req, ACT_Avg, and Books_Supplies_Cost.

Model Comparison Table
Metric Value Comments
Adjusted R-squared (cp_mod) 0.8761000 Lowest Adjusted R-squared indicates worse fit: CP
Adjusted R-squared (aicstepwise_model) 0.8919000 Highest R-squared indicates best fit: AIC
Adjusted R-squared (final_lasso_model) 0.8917000 Better than CP and worse than AIC model: LASSO
t-test p-value (cp vs aic) 1.0000000 No significant difference in residuals’ means.
t-test p-value (cp vs lasso) 1.0000000 No significant difference in residuals’ means.
t-test p-value (aic vs lasso) 1.0000000 No significant difference in residuals’ means.
ANOVA F-test p-value (cp vs aic) 0.0000000 AIC model fits significantly better than CP.
ANOVA F-test p-value (cp vs lasso) 0.0000000 Lasso model fits significantly better than CP.
ANOVA F-test p-value (aic vs lasso) 0.3806305 No significant difference in fit: AIC and LASSO.
AIC (cp) 5388.7300000 Higher value indicates worse fit: CP
AIC (aicstepwise) 5265.6400000 Lowest value indicates best fit: AIC
AIC (lasso) 5281.1100000 Better than CP and worse than AIC model: LASSO.
BIC (cp) 5515.1500000 Higher value indicates worse fit: CP
BIC (aicstepwise) 5481.6000000 Lowest value indicates best fit: AIC
BIC (lasso) 5591.9000000 Better than CP and worse than AIC model: LASSO
CV RMSE (cp) 1.5994953 Highest RMSE indicates worse fit: CP
CV RMSE (aicstepwise) 1.5391344 Lowest RMSE indicates best fit: AIC.
CV RMSE (lasso) 1.5605581 Intermediate RMSE better than CP and worse than AIC: LASSO.

Model Evaluation

ANOVA (Analysis of Variance) is used to assess the significance of the models and test the overall significance of the model, testing a null hypothesis that the model’s terms do not significantly explain the variability in the response variable. The test validates that the predictors used in the model are or are not all negligible. As such, the ANOVA table p-values are used to assess the contribution of each predictor to the overall model fit, testing whether each predictor significantly improves the model fit compared to the model without that predictor, considering the contribution of all other predictors. These p-values show whether adding a particular predictor (or a whole group of predictors) significantly improves the model compared to the baseline model (without those predictors).

On the other hand, the p-values from the summary table are related to the coefficients of the predictors in the regression model, and test whether each coefficient is significantly different from zero, given the other predictors in the model. Thus, these p-values assess the significance of each predictor’s coefficient individually, indicating whether each predictor has a statistically significant effect on the response variable, holding other predictors constant. They’re useful for understanding the strength of covariates’ relationships with the response variable.

In this case, specifically, I am curious about the ANOVA p-values of my reduced model, and why I have retained some covariates with quite high summary p-values like Admission_Rate. Therefore I preformed an ANOVA with a custom output table for the aicstepwise_model to evaluate the contribution of each predictor in the model.

While a variable like Undergrads has a similarly high ANOVA f-test p-value, it has a very low summary p-value, which makes me suspect it must have some degree of multicollinearity, which would make logical sense as I have also included demographic proportions like race of undergrad in the model. While the variable alone may not explain a significant portion of the variance in tuition, it plays a significant role when considered with other predictors in the regression model.

On the other hand, Admission_Rate has a summary-table p-value of 0.086888. While significant at an alpha level of 0.1, but not at the more conventional 0.05 level. Coupled with the f-test p-value of 0.8387, this indicates that Admission_Rate does not significantly contribute to the model when considered in isolation, nor does it overall contribute to explaining the variance in tuition.

ANOVA P-values for Covariates in aicstepwise_model
Covariate P-Value
Pub_Priv 0.0000000
Undergrads 0.6473092
NonRes_Alien_Undergrads 0.0000000
Avg_Faculty_Salary 0.0000000
Completion_Rate 0.0000000
Pct_Fed_Loans 0.0000000
Pct_Pell_Grants 0.0000000
School_Type 0.0000000
REGION 0.0000000
Admission_Rate 0.8874653
Hisp_Undergrads 0.0000001
Asian_Undergrads 0.0010325
NHPI_Undergrads 0.0083622
Instructional_Expenses_per_FTE 0.0023384
Stand_Test_Req 0.0196010
ACT_Avg 0.0000003
Books_Supplies_Cost 0.0616579
Residuals NA

Interaction Term

Why, then, might both the AIC-Stepwise selected model and the LASSO model retain Admission_Rate? I was curious if the variable might be interacting with other predictors in the model, contributing to the tuition value in a way that is only evident when considering these interactions. I’d like to check, therefore, for an interaction effect.

To this end, I ran an interaction test model: I regressed the same transformed tuition value upon the same selected covariates, but this time I multiplied all of them by Admission_Rate. In this way I forced interaction terms into my stepwise AIC model to test whether the relationship between Admission_Rate and Instate_Tuition_transformed depends on other covariates.

I found significant interaction terms (low p-values) for Undergrads (p= 0.006402) and Instructional_Expenses_per_FTE (p= 0.000003491) which was fascinating.

To determine whether this performed better than the model without interaction terms, I compared this new potentially valuable model which included the interaction terms found significant in the previous test, with the previously selected stepwise AIC model.

  • Adjusted R-squared:
  • aicstepwise_model Adjusted R-squared: 0.8919114
  • model_with_interactions Adjusted R-squared: 0.894088

The adjusted R-squared value increased from 0.8919114 to 0.894088, indicating that the model with interaction terms explains slightly more variance in the Instate_Tuition_transformed than the model without interaction terms.

  • AIC (Akaike Information Criterion):
  • aicstepwise_model AIC: 5274.001
  • model_with_interactions AIC: 5245.778

The AIC value decreased from 5274.001 to 5245.778, suggesting that the model with interaction terms is a better fit to the data. A lower AIC value indicates a model with a better trade-off between goodness of fit and complexity.

Including these interaction terms adheres to the hierarchical principle, ensuring that all lower-order terms (main effects) are included in the model alongside their interaction terms. The approach also provides a more nuanced understanding of the factors influencing instate tuition at different institutions.

Based on the analysis, it’s evident that certain interaction effects significantly improve the model’s fit to the tuition data.

For these reasons, I elect to keep this model with interaction terms as my final reduced model. Again, assumptions of linearity, normality, and homoskedasticity were reasonable (see: Fig. 12).

IV. Results

Now, the final reduced model for instate tuition is:

\[\begin{equation} \text{Instate_Tuition_transformed} = \beta_0 + \beta_1 \text{Pub_Priv} + \beta_2 \text{Undergrads} + \beta_3 \text{Admission_Rate} + \beta_4 (\text{Undergrads} \times \text{Admission_Rate}) + \beta_5 \text{NonRes_Alien_Undergrads} + \beta_6 \text{Avg_Faculty_Salary} + \beta_7 \text{Completion_Rate} + \beta_8 \text{Pct_Fed_Loans} + \beta_9 \text{Pct_Pell_Grants} + \beta_{10} \text{School_Type} + \beta_{11} \text{REGION} + \beta_{12} (\text{Admission_Rate} \times \text{Instructional_Expenses_per_FTE}) + \beta_{13} \text{Hisp_Undergrads} + \beta_{14} \text{Asian_Undergrads} + \beta_{15} \text{NHPI_Undergrads} + \beta_{16} \text{Instructional_Expenses_per_FTE} + \beta_{17} \text{Stand_Test_Req} + \beta_{18} \text{ACT_Avg} + \beta_{19} \text{Books_Supplies_Cost} + \epsilon \end{equation}\]

Cross-validation

It is important, however, to cross validate the performance of this model, and to ensure the robustness of our model. After partitioning the dataset into randomly shuffled training and test subsets, I iteratively trained and validated the model, ensuring that promising findings were not a result of overfitting.

Data Partitioning: The dataset was shuffled and partitioned into training (80%) and test (20%) subsets.

  • Once partitioned into these two distinct subsects, my final model (with interaction terms) was rerun on the training subset.

Model Prediction: I then used this “trained model,” to predict instate tuition on the test subset.

Validation: Based on this partitioned process of modeling and testing, I was able to then compute performance metrics to assess how well the model performed upon “unfamiliar” data, by evaluating a few performance metrics. Specifically, I analyzed the RMSE, Root Squared Mean Error, of the prediction by manually calculating the square root of the mean of the error between predicted tuition values and observed tuition values, squared: \(\sqrt{\frac{1}{K}\sum_{j=1}^K{MSE_j}}\).

This RMSE value evaluated to 1.657772. Although there is no precise rule of thumb for acceptable range of RMSE, it is said a lower RMSE is better, and I consider this a convincingly low value. These metrics indicate that the model performs well on the test data, with a reasonably low RMSE value, indicating accurate predictions.

## RMSE: 1.652168
## MAE: 1.134817
## R-squared: 0.859615

Coefficient Estimates

Finally, then, the table below presents the coefficients of my final reduced model, including their estimates, standard errors, t-values, and p-values. These coefficients provide insights into the relationship between the predictors and the transformed instate tuition.

Coefficients of the Final Model with Interactions
Estimate P Value
(Intercept) 13.5363561 0.0000000
Pub_Priv2 6.0533216 0.0000000
Pub_Priv3 4.4110929 0.0000000
Undergrads -0.0001110 0.0000000
Admission_Rate -0.8789093 0.0336953
NonRes_Alien_Undergrads 8.5364750 0.0000000
Avg_Faculty_Salary 0.0002323 0.0000000
Completion_Rate 1.7272416 0.0000683
Pct_Fed_Loans 4.6413534 0.0000000
Pct_Pell_Grants -4.7431155 0.0000000
School_Typehealth 1.7810479 0.1106748
School_Typenational-liberal-arts-colleges 4.7622733 0.0000088
School_Typenational-universities 4.1328049 0.0000964
School_Typeregional-colleges-midwest 4.1861574 0.0001111
School_Typeregional-colleges-north 3.6885889 0.0008568
School_Typeregional-colleges-south 4.4491751 0.0000376
School_Typeregional-colleges-west 3.3887029 0.0023296
School_Typeregional-universities-midwest 3.7299847 0.0004538
School_Typeregional-universities-north 4.0341798 0.0001644
School_Typeregional-universities-south 4.3004833 0.0000548
School_Typeregional-universities-west 3.8394442 0.0003601
School_Typespecial-focus 3.9802127 0.0046023
School_Typespecialty-arts 4.0415706 0.0007363
School_Typespecialty-business 3.2294721 0.0309634
REGION2 -0.1198805 0.5155299
REGION3 0.1145863 0.6378957
REGION4 -0.3138381 0.2288340
REGION5 -0.4689905 0.0411891
REGION6 0.1561071 0.5719899
REGION7 -0.1620077 0.6421494
REGION8 0.0176923 0.9473885
REGION9 -3.8762455 0.0000000
Instructional_Expenses_per_FTE 0.0000018 0.8131438
Hisp_Undergrads 1.9393640 0.0000060
Asian_Undergrads 0.7132051 0.4921246
NHPI_Undergrads -4.7838820 0.6594059
Stand_Test_Req3 0.2640471 0.2235904
Stand_Test_Req5 0.2795082 0.1482670
ACT_Avg 0.1324020 0.0000000
Books_Supplies_Cost 0.0001487 0.1403526
Undergrads:Admission_Rate 0.0000803 0.0004135
Admission_Rate:Instructional_Expenses_per_FTE 0.0000731 0.0002516

Being a private institution (Pub_Priv2) increases the instate tuition by approximately 6.04 units relative to a public institution, holding all other variables constant. Likewise, institutions classified as a different type of private (Pub_Priv3) have an increase in tuition by about 4.41 units compared to public institutions, with all other factors held constant.

Undergrads: For each additional undergraduate student, the instate tuition decreased by approximately 0.0001109 units, with all other factors held constant. This indicates a very slight negative relationship, where larger student bodies are associated with marginally lower tuition costs.

Admission_Rate: A one-unit increase in the admission rate is associated with a decrease in instate tuition by approximately 0.895 units, with all other factors held constant. This suggests that as institutions become less selective, their tuition tends to decrease.

NonRes_Alien_Undergrads: Each additional non-resident alien undergraduate student is associated with an increase in instate tuition of approximately 8.66 units, with all other factors held constant. This suggests that institutions with higher numbers of international students tend to charge more in-state tuition.

Avg_Faculty_Salary: An increase in faculty salaries by one unit is associated with an increase in tuition of approximately 0.000235 units with all other factors held constant, indicating that institutions with higher average faculty salaries tend to charge higher tuition.

Completion_Rate: For each additional percentage point in completion rate, instate tuition increases by approximately 1.80 units, with all other factors held constant. This suggests that institutions with higher completion rates tend to have higher tuition fees.

Pct_Fed_Loans: Each percentage point increase in federal loan recipients is associated with an increase in instate tuition by approximately 4.68 units, with all other factors held constant, implying that institutions with higher proportions of students receiving federal loans charge higher tuition.

Pct_Pell_Grants: Each percentage point increase in Pell Grant recipients is associated with a decrease in tuition by approximately 4.72 units, with all other factors held constant. This suggests that higher proportions of Pell Grant recipients are linked with lower tuition costs.

School_Type: Relative to reference category, “Unranked and Other Institutions”

National-liberal-arts-colleges: Institutions of this type have an increase in tuition by approximately 4.79 units compared to unranked institutions, and holding all else constant.

National-universities: This type of institution has a tuition increase of about 4.13 units compared to unranked institutions, and holding all else constant.

Regional-colleges-midwest: The increase in tuition is approximately 4.21 units compared to unranked institutions, and holding all else constant.

Regional-colleges-north: The increase is about 3.91 units compared to unranked institutions, and holding all else constant.

Regional-colleges-south: The increase is approximately 4.50 units compared to unranked institutions, and holding all else constant.

Regional-colleges-west: The increase is about 3.42 units compared to unranked institutions, and holding all else constant.

Regional-universities-midwest: This type of institution has an increase of about 3.76 units compared to unranked institutions, and holding all else constant.

Regional-universities-north: The increase is approximately 4.06 units compared to unranked institutions, and holding all else constant.

Regional-universities-south: The increase is about 4.39 units compared to unranked institutions, and holding all else constant.

Regional-universities-west: The increase is approximately 3.85 units compared to unranked institutions, and holding all else constant.

Special-focus: The increase in tuition is about 4.12 units compared to unranked institutions, and holding all else constant.

Specialty-arts: The increase is approximately 4.03 units compared to unranked institutions, and holding all else constant.

Specialty-business: The increase is about 3.22 units compared to unranked institutions, and holding all else constant.

REGION9: Institutions in Puerto Rico had a decrease in tuition of about 3.88 units compared to reference category, the North East, holding all else constant. Other regional variables were not statistically significant at the 0.05 level.

Hisp_Undergrads: Each additional Hispanic undergraduate student is associated with an increase in instate tuition of approximately 1.94 units, holding all else constant.

ACT_Avg: Each additional point in the average ACT score is associated with an increase in instate tuition by approximately 0.134 units, holding all else constant.

Admission_Rate:Undergrads: A one-unit increase in the interaction term is associated with an increase in tuition of approximately 0.0000807 units. This suggests that the effect of the number of undergraduates on tuition varies with the admission rate.

Admission_Rate:Instructional_Expenses: A one-unit increase in this interaction term is associated with an increase in tuition of approximately 0.0000756 units, indicating that the relationship between admission rate and tuition is influenced by instructional expenses. Thus, resource allocation strategies likely vary according to the institution’s selectivity.

Prediction Scenarios

I wish to know what the tuition a student who wished to attend a competitive public national university in the southeast with a cohort size of around 20,000 and no testing requirements could have expected in 2022. By specifying these conditions in a new data frame (Pub_Priv==1, Undergrads==20000, REGION==5, Admission_Rate== 0.17, Stand_Test_Req==5, etc.), and applying that data frame to the predict function with this model, and finally inverting the Box-Cox transformation, I find that such a student (if they had enrolled in 2022), could have expected a tuition amount of about $7,637.

This nearly describes the University of North Carolina: Chapel Hill, which itself reported instate tuition at $8,989 in 2022. By using the specified conditions for each scenario, the model provides a close approximation to the actual tuition costs reported, validating the model’s predictive capability and highlighting its utility for estimating tuition under specified conditions.

Next, I am interested in predicting what the tuition would have been for a student who wished to attend an even more competitive private liberal arts school in the northeast with a cohort size of around 7,000, and no testing requirements. I will also imagine that this is an international student who would like a considerable international-student community around them.

By specifying these conditions in a new data frame (Pub_Priv==2, Undergrads==7000, REGION==1, Admission_Rate== 0.1, Stand_Test_Req==5, NonRes_Alien_Undergrads == 0.2, etc.), and applying that data frame to the predict function with this model, and finally by inverting the response variable out of the Box-Cox transformation, I find that such a student (if they had enrolled in 2022), could have expected a tuition amount of about $37,809 on average.

Besides the region, this nearly describes Carnegie Mellon Univiersity. However, they reported tuition values of about $61,344 in 2022. This is notably higher than my model’s prediction.

Several factors might explain this discrepancy:

The model may not fully capture all the nuances that contribute to tuition rates at highly competitive institutions like Carnegie Mellon University such as the institutional prestige, high demand, and specialized programs, which could contribute to higher tuition.

This model was built on a diverse set of institutions and may not fully generalize to highly selective institutions with pricing strategies and financial structures unlike more typical institutions. The data used for modeling might not include specific variables or detailed characteristics that are unique to high-cost private institutions, such as specific program costs or institutional endowments. These might influence tuition rates but are not included in the current model.

In summary, while the model provides a reasonable estimate for average tuition based on the specified conditions, it’s important to consider additional institutional-specific factors when predicting tuition values.

Also worth revisiting the effect of the interaction term between, for example, Admission_Rate and Undergrads. This suggests that the relationship between the number of undergraduates at an institution and the tuition amount is influenced by the institution’s selectivity. This can be seen to an extent in this plot of tuition as a function of admission rates and undergraduate count.

V. Conclusion

This study has aimed to better understand the determinants of in-state tuition rates by employing a robust model that attempts to capture the most influential institutional characteristics. The findings reveal significant insights into how different factors contribute to tuition costs in the US, and which factors contribute very little.

Being a private institution, for example, is associated with a substantial increase in in-state tuition, approximately 6.04 units higher compared to public institutions. Relatedly, institutions categorized under a different type of private institution (Pub_Priv3) also experienced a 4.41 unit increase in tuition relative to public institutions, with other variables held constant. These results suggest a notable premium attached to private institutions, reflecting factors such as additional amenities and services that may justify the higher tuition.

Further, a nuanced relationship is observed between student enrollment and tuition. Specifically, an increase in the number of undergraduates is associated with a marginal decrease in tuition by approximately 0.0001109 units. This minor negative relationship indicates that larger student bodies might benefit from economies of scale, leading to slightly lower tuition rates. Additionally, a higher admission rate correlates with a reduction in tuition, with each unit increase in the admission rate associated with a decrease of about 0.895 units. This finding aligns with the concept that less selective institutions may offer lower tuition rates.

The study also considers the impact of international students and faculty salaries on tuition. The presence of each additional non-resident alien undergraduate student is linked to an increase in tuition by approximately 8.66 units, suggesting that institutions with larger international student populations incur higher costs. Similarly, each unit increase in average faculty salary is associated with a 0.000235 unit rise in tuition, reflecting the higher expenditure on competitive faculty compensation.

Furthermore, institutions with higher completion rates and federal loan recipients tend to charge higher tuition, with increases of 1.80 units and 4.68 units per percentage point, respectively. Conversely, a greater percentage of Pell Grant recipients correlates with a reduction in tuition by about 4.72 units per percentage point. These relationships indicate that institutions leverage their financial aid strategies and completion rates to justify varying tuition levels.

Regional differences also play a significant role in tuition costs. For example, institutions in Puerto Rico exhibit lower tuition rates by approximately 3.88 units compared to those in the Northeast. Additionally, the study highlights variations in tuition across institutional types and regions, providing valuable context for understanding regional and institutional cost structures.

Despite these insights, the study acknowledges limitations, particularly in capturing the full range of factors affecting highly selective institutions such as Carnegie Mellon University. The model may not account for unique elements like institutional prestige and specialized programs, which could contribute to higher tuition. Future research should aim to address these gaps by incorporating additional variables and exploring institutional-specific characteristics in greater depth.

While the interaction effects between Admission_Rates and Undergrad and Instructional expenses respectively did not yield especially large coefficients, they were both highly significant, which is a fascinating but logically rational finding. Institutions with different selectivity levels show different patterns in how the number of undergraduates influences tuition. For instance, a highly selective institution with a low admission rate might have a different pricing strategy compared to a less selective institution.

While it makes logical sense, that for example, larger institutions (with more undergraduates) may adjust their tuition fees differently based on how selective they are, this is still a novel finding that I had neither anticipated nor come across in my admittedly cursory consideration of literature on the subject.

It is worth noting too, that this is an analysis based on Title IX accredited institutions which adhere to federal regulations ensuring gender equity in education. It is worth also considering, therefore, how Title IX compliance impacts tuition rates.

The implications of this research extend to policy and institutional decision-making by revealing trends in educational accessibility and affordability. By improving the accuracy of tuition predictions and understanding the role of various drivers, I believe this study has been a valuable contribution to existing literature. Further and future exploration into the impact of Title IX compliance and other institutional factors on tuition rates would contribute to a still more comprehensive understanding of higher education costs.

VI. References

U.S. Department of Education. (n.d.). College Scorecard.

Integrated Postsecondary Education Data System (IPEDS). (n.d.). IPEDS.

VII. Appendix

Diagnostic plots for untransformed, unpared full models, imputed and unimputed:

plot(model, main = "Fig. 1: Diagnostic Plots for Original Unimputed Regression Model")

Clear bisection in observation and downward bow apparent in Residuals vs. Fitted plot– linearity dubious. Q-Q Residuals plot likewise does not follow trend expected for full normality. Observations at tails not on trendline. Scale-Location plot not kinked or bowed, but points do not appear entirely randomly scattered. Residuals vs Trendlines plot does not indicate notable disotortion.

plot(knn_imputed_mod, main = "Fig. 2: Diagnostic Plots for KNN Imputed Regression Model")

Same issues as in unimputed diagnostic plots– bisection and downward bow remains apparent in Residuals vs. Fitted plot– linearity still dubious. Q-Q Residuals plot likewise does not follow trend expected for full normality, though perhaps improvement from unimputed model. Observations at lower tail still not fully on trendline. Scale-Location plot appears perhaps more random, but also almost kinked, not too concerning. Residuals vs Trendlines plot does now indicate observation 1240 as within the bands of contours of leverage values 0.5 and 1.

plot(model_rf_imputed, main = "Fig 3: Diagnostic Plots for Random Forest Imputed Regression Model")

Nearly identical to KNN-imputed model’s diagnostic plots– bisection and downward bow remains apparent in Residuals vs. Fitted plot– linearity still dubious. Q-Q Residuals plot likewise does not follow trend expected for full normality, though perhaps improvement from unimputed model. Observations at lower tail still not fully on trendline. Scale-Location plot appears perhaps more random, but also almost kinked, not too concerning. Residuals vs Leverage plot does now indicate observation 1240 as within the bands of contours of leverage values 0.5 and 1.

Diagnostic plots for RF-Imputed Model with Distortionary Points Removed

plot(model_rf_imputed2, main = "Fig. 4: Diagnostic Plots for Updated RF-Imputed Model")

Residuals vs Leverage plot is certainly much improved. However, issues remain with linearity, and to a lesser extent, normality and heteroskedasticity.

Diagnostic Plots for Response-Transformed Imputed Models

plot(model_transformed, main = "Fig. 5: Box Cox Transformation")

plot(model_log, main = "Fig. 6: Log Transformation")

plot(model_sqrt, main = "Fig. 7: Square Root Transformation")

plot(model_cubic, main = "Fig. 8: Cubic Transformation")

Diagnostic Plots for Reduced Models

plot(cp_mod, main = "Fig. 9: Mallow's CP Minimizing Model")

Linearity appears much improved with this reduced model than it had for prior more complex models. Other assumptions, of normality homoskedasticity still appear reasonably met, and no evident distortionary points.

plot(aicstepwise_model, main= "Fig. 10: Stepwise AIC-selected Model")

Again, linearity appears much improved with this reduced model than it had for prior more complex models. Other assumptions, of normality homoskedasticity still appear reasonably met, and no evident distortionary points.

plot(final_lasso_model, main="Fig. 11: LASSO-Reduced Model")

Much the same as prior two reduced models. Assumptions are still met, with linearity improved but still the most tenuous among them.

plot(model_with_interactions, main="Fig. 12: Stepwise AIC Model with Interaction Terms")

Again, visually, this is much the same as prior reduced models. Assumptions are still met, with linearity improved but still the most tenuous among them.